merge from upstream #73

l3utterfly · 2025-07-06T10:57:40Z

No description provided.

* ggml : add ggml_set_rows Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: ggml-org#8366 * use I64 for indices * ggml : add repeat impl for i64 * ggml : add ggml_is_contiguous_rows * ggml : ggml_set_rows support broadcast * ggml : ggml_set_rows support quantized dst ggml-ci * ggml : support GGML_TYPE_F32 ".from_float" trait * ggml : ggml_set_rows update comment + better index name * tests : add ggml_set_rows * metal : add ggml_set_rows implementation ggml-ci * ggml : simplify forward_dup_f32 * ggml : fix supports_op * tests : add comment to set_rows * ggml : leave the repeat_i64 for a separate PR ggml-ci * ggml : set_rows use std::min instead of MIN * ggml : better error message for set_rows unsupported type * metal : perform op->type check only once * tests : more consistent implementation + more tests ggml-ci --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml-ci

This setting needs to be passed through to vulkan-shaders-gen

Add Day-0 support for Baidu ERNIE 4.5 0.3B model. Signed-off-by: Weizhao Ouyang <[email protected]>

ggml-org#14378)

) * CUDA: add bf16 and f32 support to cublas_mul_mat_batched * Review: add type traits and make function more generic * Review: make check more explicit, add back comments, and fix formatting * Review: fix formatting, remove useless type conversion, fix naming for bools

* vulkan: Add fusion support for RMS_NORM+MUL - Add a use_count to ggml_tensor, so we can detect if an output is used more than once. - Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor. - Add detection logic and basic fusion logic in ggml-vulkan. - Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test. * extract some common fusion logic * fix -Winconsistent-missing-override * move ggml_can_fuse to a common function * build fix * C and C++ versions of can_fuse * move use count to the graph to avoid data races and double increments when used in multiple threads * use hash table lookup to find node index * change use_counts to be indexed by hash table slot * minimize hash lookups style fixes * last node doesn't need single use. fix type. handle mul operands being swapped. * remove redundant parameter --------- Co-authored-by: slaren <[email protected]>

* implement unary REGLU/GEGLU/SWIGLU cpu ops * relax constraints * duplicate shape of source * fix ggml_vec_geglu_f16 * special case gated ops * implement unary REGLU/GEGLU/SWIGLU cuda ops * tighten constraints again * refactor into GGML_GLU_OP * metal : add glu kernels ggml-ci * add CUDA_GLU_BLOCK_SIZE [no ci] * more constraints and use 64bit ints ggml-ci * 64bit multiplication [no ci] * implement swapped variants (cpu/cuda) * update comment [no ci] ggml-ci * Vulkan: Add GLU ops and shaders * SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate * ggml : implement GLU for split up/gate (ggml-org#14181) * implement GLU for split up/gate * add tests for ggml_glu_split * Vulkan: Implement glu_split logic and shader support * add split to logging [no ci] * SYCL: refactor element_size ops and add split up and gate support to gated kernels * SYCL: switch GEGLU to use tanh approximation --------- Co-authored-by: 0cc4m <[email protected]> Co-authored-by: Akarshan <[email protected]> * GGML: increase OP count in assertion * Refactor: Optimize SYCL element-wise operations with unary function inlining This commit refactors the SYCL element-wise operations to improve performance by: - Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead. - Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions. - Using `__dpct_inline__` to encourage compiler inlining. - Minor code cleanup and consistency improvements. The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices. * vulkan: Increase workgroup size for GLU, for performance (ggml-org#14345) * vulkan: Increase workgroup size for GLU, for performance * vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup * merge fix * metal : add support for split and swap ggml-ci --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: 0cc4m <[email protected]> Co-authored-by: Akarshan <[email protected]> Co-authored-by: Jeff Bolz <[email protected]>

* SYCL: disable faulty fp16 CPU exponent for now * Revert "SYCL: disable faulty fp16 CPU exponent for now" This reverts commit ed0aab1. * SYCL: disable faulty fp16 CPU exponent for now * Fix logic of disabling exponent kernel

…ml-org#14322)

…eature), from command line and from client (ggml-org#13196) * initial commit for handling extra template kwargs * enable_thinking and assistant prefill cannot be enabled at the same time * can set chat_template_kwargs in command line * added doc * fixed formatting * add support for extra context in generic template init * coding standard: common/chat.cpp Co-authored-by: Georgi Gerganov <[email protected]> * coding standard: common/chat.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Apply suggestions from code review coding standard: cosmetic changes Co-authored-by: Georgi Gerganov <[email protected]> * fix merge conflict * chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context) * normalize environment variable name * simplify code * prefill cannot be used with thinking models * compatibility with the new reasoning-budget parameter * fix prefill for non thinking models --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Olivier Chafik <[email protected]>

* Update docker.yml 修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动 * Remove redundant include path in CMakeLists.txt The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths. * Enable scheduled Docker image builds Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

* metal : disable fast-math for some cpy kernels ggml-ci * cont : disable for q4_1 ggml-ci * cont : disable for iq4_nl ggml-ci

ggml-ci

* Conv2D: Add CPU version * Half decent * Tiled approach for F32 * remove file * Fix tests * Support F16 operations * add assert about size * Review: further formatting fixes, add assert and use CPU version of fp32->fp16

This commit renames the variable `best_mad` to `best_error` in the `make_qkx2_quants` function. The motivation for this is that the name `best_mad` can be somewhat confusing if mean absolute deviation (MAD) is not in use.

* add "align corners" mode for bilinear upscale, and allow downscaling * add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag * test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners

ggml-ci

…amic libraries search for dependencies in their origin directory. (ggml-org#14309)

* ggml : add version function to get lib version This commit adds a function `ggml_version()` to the ggml library that returns the version of the library as a string. The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used. Usage: ```c printf("GGML version: %s\n", ggml_version()); ``` Output: ```console GGML version: 0.0.2219 ``` * ggml : add ggml_commit() --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml-ci

* llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * cuda : graceful fallback for Mamba-1 models with weird embd size

* add support for chat template jinja files * remove gemma3n hack

…-org#14497)

ggml-ci

* ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1

* kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci

* convert : correct gemma 3n conversion * rm redundant code

…g#14504) Signed-off-by: nscipione <[email protected]>

* vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes

…g#14002) Co-authored-by: luyuhong <[email protected]>

ggml-ci

…14368) * test-backend-ops: add support for specifying output format Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * Add build_commit and build_number in test_result Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * refactor Signed-off-by: Xiaodong Ye <[email protected]> * Get build commit from ggml_commit() Signed-off-by: Xiaodong Ye <[email protected]> * Merge errors into test_operation_info && address review comments Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * remove visitor nonsense * remove visitor comment Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]> Co-authored-by: slaren <[email protected]>

…14360)

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.

Commit taken from remyoudompheng's PR ggml-org#12260 Co-authored-by: Rémy Oudompheng <[email protected]>

CISC and others added 30 commits June 27, 2025 10:42

convert : fix broken sentencepiece vocab (ggml-org#14416)

f667f1e

recurrent : call balloc split_reset() in init_batch() (ggml-org#14414)

4367806

ggml-ci

graph : make llm_graph_context destructor virtual (ggml-org#14410)

72babea

ggml-ci

vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (ggml-org#14427)

ceb1bf5

This setting needs to be passed through to vulkan-shaders-gen

ci : fix windows build and release (ggml-org#14431)

6609507

fix async_mode bug (ggml-org#14432)

b25e927

model : add support for ERNIE 4.5 0.3B model (ggml-org#14408)

566c16f

Add Day-0 support for Baidu ERNIE 4.5 0.3B model. Signed-off-by: Weizhao Ouyang <[email protected]>

vulkan: lock accesses of pinned_memory vector (ggml-org#14333)

00d5282

vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (

63a7bb3

ggml-org#14378)

ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (ggml-org#14443)

a5d1fb6

SYCL: disable faulty fp16 exp kernel (ggml-org#14395)

f47c1d7

* SYCL: disable faulty fp16 CPU exponent for now * Revert "SYCL: disable faulty fp16 CPU exponent for now" This reverts commit ed0aab1. * SYCL: disable faulty fp16 CPU exponent for now * Fix logic of disabling exponent kernel

server : fix appearance of the chats list context menu for Safari (gg…

83790b0

…ml-org#14322)

scripts : make the shell scripts cross-platform (ggml-org#14341)

e9b6350

test-backend-ops : disable llama test (ggml-org#14461)

eb3fa29

ggml-cpu: sycl: Re-enable exp f16 (ggml-org#14462)

a7417f5

metal : disable fast-math for some cpy kernels (ggml-org#14460)

5dd942d

* metal : disable fast-math for some cpy kernels ggml-ci * cont : disable for q4_1 ggml-ci * cont : disable for iq4_nl ggml-ci

memory : correctly handle failure in apply() (ggml-org#14438)

745f11f

ggml-ci

Add Conv2d for CPU (ggml-org#14388)

0a5a3b5

* Conv2D: Add CPU version * Half decent * Tiled approach for F32 * remove file * Fix tests * Support F16 operations * add assert about size * Review: further formatting fixes, add assert and use CPU version of fp32->fp16

opencl : add GEGLU, REGLU, SWIGLU (ggml-org#14456)

79b33b2

ggml-quants : rename best_mad to best_error (ggml/1283)

497be7c

This commit renames the variable `best_mad` to `best_error` in the `make_qkx2_quants` function. The motivation for this is that the name `best_mad` can be somewhat confusing if mean absolute deviation (MAD) is not in use.

sync : ggml

f61c05d

ggml-ci

ggml : remove trailing whitespace (#0)

a6a4795

add GELU_ERF (ggml-org#14455)

eff5e45

rotemdan and others added 28 commits July 2, 2025 18:37

Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dyn…

f3ed38d

…amic libraries search for dependencies in their origin directory. (ggml-org#14309)

sync : ggml

e17991c

ggml-ci

gguf-py : add support for chat template jinja files (ggml-org#14508)

e75ba4c

* add support for chat template jinja files * remove gemma3n hack

CUDA: add dynamic shared mem to softmax, refactor general usage (ggml…

55c2646

…-org#14497)

ggml : remove kompute backend (ggml-org#14501)

d4cdd9c

ggml-ci

ggml : fix FA mask dim 2 and 3 (ggml-org#14505)

9067487

* ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1

convert : correct gemma 3n conversion (ggml-org#14450)

0c2ee38

* convert : correct gemma 3n conversion * rm redundant code

Fix conditional enabling following arch checks for ggml-sycl (ggml-or…

7b63a71

…g#14504) Signed-off-by: nscipione <[email protected]>

ggml: backward pass for split swiglu (ggml-org#14483)

c8c4495

vulkan: support mixed/deepseekR1 FA head sizes (ggml-org#14509)

2b72bed

* vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes

opencl : broadcast for soft_max (ggml-org#14510)

bee2842

ggml : implement GEGLU_ERF and GEGLU_QUICK ops (ggml-org#14445)

28657a8

CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (ggml-or…

499a8f5

…g#14002) Co-authored-by: luyuhong <[email protected]>

batch : add n_used count (ggml-org#14512)

c79184d

ggml-ci

graph : prepare for 4D mask (ggml-org#14515)

7b50f7c

ggml-ci

batch : add optional for sequential equal split (ggml-org#14511)

67d1ef2

ggml-ci

metal : disable fast math in all quantize kernels (ggml-org#14528)

ef797db

ggml-ci

eval-callback : check for empty input (ggml-org#14539)

bac8bed

opencl: add GELU_ERF (ggml-org#14476)

6681688

server : fix assistant prefilling when content is an array (ggml-org#…

ddef995

…14360)

vulkan: Handle updated FA dim2/3 definition (ggml-org#14518)

a0374a6

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

vulkan: fix rms_norm+mul fusion (ggml-org#14545)

e592be1

The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.

vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (ggml-org#14485)

6491d6e

Commit taken from remyoudompheng's PR ggml-org#12260 Co-authored-by: Rémy Oudompheng <[email protected]>

Merge branch 'layla-build' into merge

8b273a5

l3utterfly merged commit 33c38f8 into layla-build Jul 6, 2025
13 of 50 checks passed

l3utterfly deleted the merge branch July 6, 2025 11:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge from upstream #73

merge from upstream #73

Uh oh!

l3utterfly commented Jul 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

merge from upstream #73

merge from upstream #73

Uh oh!

Conversation

l3utterfly commented Jul 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants